-
-
Notifications
You must be signed in to change notification settings - Fork 797
Cpu fused kernel #1804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Cpu fused kernel #1804
Conversation
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Hi @matthewdouglas . All CI passed, I have rebased the PR. Please let me know what need to be changed before merge. Thanks! |
Signed-off-by: jiqing-feng <[email protected]>
| if ( | ||
| not self.enable_optimized_cpu | ||
| and x.device.type == "cpu" | ||
| and has_avx512bf16() | ||
| and not self.training | ||
| and x.requires_grad == False | ||
| ): | ||
| self.weight.data, quant_state = convert_weight_packed_for_cpu(self.weight.data, quant_state) | ||
| self.enable_optimized_cpu = True | ||
| quant_state.enable_optimized_cpu = True | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple things I'm wondering about:
When we serialize from CPU after running through forward(), we probably still want to be compatible with other devices. I am thinking for when serializing we want to undo this transformation if it's present.
Possibly an edge concern, but if we do a forward pass on CPU and then move to an accelerator, what would happen? I assume the weights are then in the wrong order?
@SunMarc I would appreciate any feedback you might have on this part!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, I prefer that we stick with only one packing format for serialization and the all other hardware / kernels convert this packing format at initialization or during the forward as we do here.
So we need a way to disable serialization or send a warning when someone tries to do that. This is probably something that we can do in transformers as I think most of the models are serialized from there.
Also instead of enable_optimized_cpu maybe we can rename it packing_format ?
Possibly an edge concern, but if we do a forward pass on CPU and then move to an accelerator, what would happen? I assume the weights are then in the wrong order?
Either we re-convert the weights for cuda (but this opens the door to many conversion function between all packing format) or we just raise an error asking the users to only run the model on one device.
Signed-off-by: jiqing-feng <[email protected]>
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment !
| if ( | ||
| not self.enable_optimized_cpu | ||
| and x.device.type == "cpu" | ||
| and has_avx512bf16() | ||
| and not self.training | ||
| and x.requires_grad == False | ||
| ): | ||
| self.weight.data, quant_state = convert_weight_packed_for_cpu(self.weight.data, quant_state) | ||
| self.enable_optimized_cpu = True | ||
| quant_state.enable_optimized_cpu = True | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, I prefer that we stick with only one packing format for serialization and the all other hardware / kernels convert this packing format at initialization or during the forward as we do here.
So we need a way to disable serialization or send a warning when someone tries to do that. This is probably something that we can do in transformers as I think most of the models are serialized from there.
Also instead of enable_optimized_cpu maybe we can rename it packing_format ?
Possibly an edge concern, but if we do a forward pass on CPU and then move to an accelerator, what would happen? I assume the weights are then in the wrong order?
Either we re-convert the weights for cuda (but this opens the door to many conversion function between all packing format) or we just raise an error asking the users to only run the model on one device.
|
Hi @matthewdouglas . The BNB will only load 1 lib (one from cpu/cuda/xpu). It means we can only build 1 .so file for the bnb. But we cannot build CPU and XPU together, because CPU relies on openMP(libiomp5.so) but XPU relies on GNU OpenMP (libgomp.so), build them together will raise error like: In the current stage, we can only consider to build one backend, so the format cpu will not be triggered in other backends. Even though, I added the reverse logic in case we want to support multi-backends in the future. cc @SunMarc |
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
The fused kernel optimized 4bit model inference about 4x speed-up on TPOT compared dequant+matmul. For next optimization of TTFT, we need to import libxsmm.